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Factorization 
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1981  sec. 
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230K  3D  Finite  Elements 
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Automotive  Hood  Inner  Panel 
Springback  using  LS-DYNA 
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“Hood”  Elimination  Tree 
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Each  frontal  matrix’s  triangle  scaled 
by  operations  required  to  factor  it. 
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Concurrency  within  frontal  matrices 

Small  P  =>  column  wrap 

Large  P  =>  2D  (ala  UNPACK  benchmark) 

Concurrency  across  elimination  tree 

Frontal  matrices  only  dependent  on  children 
“Subtree  -  subcube”  typically  used 
Limits  communication 
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Level  1 


Level  2 


Level  3 


Shared  Memory  Concurrency 
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Ubiquitous,  cheap,  high  performance! 

GFLOPS 


2003  2004  2005  2006 


Courtesy  NVIDIA 


GPU  Architecture 
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Multiple  SIMD  cores 

Multithreaded 

0(1000)  per  GPU 

Banked  shared  memory 
16  Kbytes  Cl  060 
48  Kbytes  C2050 

Simple  thread  model 

Only  sync  at  host 
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A  sat  SIMD  multiprocessors  \A-ith  on-chip  shared  mammy. 

Figure  3-1.  Hardware  Model  Courtesy  NVIDIA 
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do  j  =  jl,  jr 

do  i  =  jr  +  1,  Id 
x  =  0 . 0 

do  k  =  j  1 ,  j-1 

x  =  x  +  s(i,  k)  *  s(k,  j) 
end  do 

s(i,  j)  =  s(i,  j)  -  x 
end  do 
end  do 


ip=0; 

for  (j  =  jl;  j  <=  jr;  j++)  { 

if (ltid  <=  ( j— 1 ) — jl) { 

gpulskj (ip+ltid)  =  s [IDXS ( jl+ltid, j) ] ; 

} 

ip  =  ip  +  (j-1)  -  jl  +  1; 

} 

_syncthreads  ( )  ; 

for  (i  =  jr  +  1  +  tid;  i  <=  Id; 
i  +=  GPUL__THREAD_COUNT )  { 

for  (j  =  jl;  j  <=  jr;  j++)  { 

gpuls ( j-jl, ltid)  =  s [IDXS (i, j) ] ; 

} 

ip=0 ; 

for  (j  =  jl;  j  <=  jr;  j++)  { 

x  =  0 . Of ; 

for  (k  =  jl;  k  <=  (j-1);  k++)  { 

x  =  x  +  gpuls (k-jl, ltid)  *  gpulskj(ip); 
ip  =  ip  +  1; 

} 

gpuls (j-jl, ltid)  -=  x; 

} 

for  (j  =  jl;  j  <=  jr;  j++)  { 

s [IDXS (i, j ) ]  =  gpuls (j-jl, ltid) ; 
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Assemble  frontal  matrix  on 
host  CPU 

Initialize  by  sending  panel 
of  assembled  frontal  matrix 

Only  large  frontal  matrices 
due  to  high  cost  of  sending 
data  to  and  from  GPU 
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Factor  diagonal  block 

Note:  host  is  faster,  but  its 
better  to  avoid  data  transfer 
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Eliminate  off-diagonal  panel 
Earlier  CUDA  code 
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Fill  Upper  Triangle 
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Update  Schur  Complement 
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Update  panels  with  DGEMM 

DGEMM  is  extremely  fast! 

We’ve  observed  >100  GFIop/s 
Tesla  C2050  (I4r8) 
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Update  Schur  Complement 
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Wider  panels  in  Schur 
complement 

DGEMM  is  even  faster 
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Return  Entire  Frontal  Matrix 
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Return  error  if  diagonal  of 
0.0  encountered  or  pivot 
threshold  exceeded 

Otherwise  complete  frontal 
matrix  is  returned 

Schur  complement  added  to 
initial  values  on  host  CPU 
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Factoring  a  Frontal  Matrix 

use  Viterbi  51 
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Method  Name 

GPU  msec 

%GPU  time 

Copy  data  to  and 
from  GPU 

201.0 

32.9% 

Factor  32x32 
diagonal  blocks 

42.6 

7.0% 

Eliminate  off 
diagonal  panels 

37.0 

6.1% 

Update  with 
SGEMM 

330.6 

54.1% 

Total  time 

JSC _ 

611.4 

100.0% 
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Calibrating  Expectations 
Dense  Kernel  Performance 
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Intel  Nehalem  Host 

2  sockets  *  4  cores  *  {4,2}  ALUs  *  2.6  GHz 
We  get  ~80  GFIop/s  (r4)  and  53  GFIop/s  (r8) 


NVIDIA  Tesla  Cl  060 

30  processors  *  {8,1}  ALUs  *  1.3  GHz 
We  get  170  GFIop/s  (r4) 


NVIDIA  Tesla  C2050  (aka,  Fermi) 

28  processors  *  {16,8}  ALUs  *  1.15  GHz 
We  get  97  GFIop/s  (r8) 
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Kernel  Performance  (i4r8) 
C2050  vs  8  Nehalem  Cores 
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Upper  GPU,  lower  CPU  -  red  means  GPU  is  faster 


Update 

Order 

Degree 

1024 

2048 

3072 

4096 

512 

N/A 

23.5 

32.3 

42.0 

22.8 

47.0 

49.9 

51.5 

1024 

22.3 

42.5 

57.0 

66.7 

43.2 

48.1 

50.5 

51.8 

1536 

36.2 

55.5 

68.8 

77.3 

42.2 

49.0 

49.9 

52.0 

2048 

47.9 

66.6 

78.2 

86.1 

46.8 

49.8 

51.2 

52.2 

2560 

57.0 

73.9 

83.6 

91.5 

48.0 

50.3 

51.5 

52.0 

3072 

65.6 

80.1 

89.0 

97.4 

49.0 

50.8 

51.4 

52.6 
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Handful  of  large  supernodes  near  the  root  of  the  tree 
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Total  time 
Linear  solver 
Factorization 
Suitable  for  GPU? 


2057  sec. 

1995  sec. 

97% 

1981  sec. 

96% 

88% 
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AWE  benchmark 
230K  3D  Finite  Elements 
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Number  of  Supernodes  & 
Factor  Operations  in  Tree 
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Number  of  SuperNodes  and  Factor  Operations  per  Level 
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Multicore  Performance  (i4r4) 
vs.  the  Elimination  Tree 
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Multi core  Performance  per  Level  (CPU  Only) 


Level 
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LS-DYNA  Implicit 
CPU  vs.  CPU  &  GPU  (I8r8) 
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LS-DYNA  on  0uter3  (End  to  End) 
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Near-term  Future 
Bigger  Problems 
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•  Problems  that  don’t  fit  in  GPU  memory 

•  Out-of-core  to  host  memory? 

•  Performance  Optimization 

•  Better  NVIDIA  libraries 

•  Re-optimize  our  CUDA  kernel 

•  Overlap  computation  &  communication 

•  Pivoting  for  numerical  stability 

•  Distributed  memory  (e.g.,  MPI) 

_  •  One  GPU  per  Supernode 

USC  •  Kernel  with  MPI  and  GPUs 
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CUBLAS  3.2  based  on  UTK’s  MAGMA 
We’ve  seen: 

SGEMM  398  Gflop/s 
DGEMM  231  Gflop/s 
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Longer-term  Future 
Smaller  Problems 


ISW 

Information  Sciences  Institute 


•  Factor  smaller  frontal  matrices  on  GPU 

•  Maintain  real  stack  on  GPU 

•  Assemble  initial  values  on  GPU 

•  If  the  entire  matrix  fits  on  the  GPU 

•  Forward  and  back  solves 

•  Exploit  GDRAM  memory  B/W 
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Factoring  large  frontal  matrices  on  Nvidia  C2050 

Sped  up  LS-DYNA  implicit 
Another  factor  of  2X  likely 
Explicit  will  be  much  harder 

Similar  results  for  other  implicit  MCAE  codes 
BCSLIB-GPU  too 

ISVs  slowly  to  come  to  market 
Modest  speedup 
Support  and  pricing  issues 
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